Software version control systems contain a huge amount of evolutionary data. It's very common to mine these repositories to gain some insight about how the development of a software product works. But there is the need for some preprocessing of that data to avoid false analysis.

That’s why I show you how to read the commit information of a Git repository into Pandas’ DataFrame!

Idea

The main idea is to use an existing Git library for Python that provides the necessary (and hopefully efficient) access to a Git repository. In this notebook, we'll use GitPython, because at first glance it seems easy to use and to do the things we need.

Our implementation strategy is straightforward: We try to avoid any functions as much as possible but try to use all the processing power Pandas delivers. The So let's get started.

Create an initial DataFrame

First, we import our two main libraries for analysis: Pandas and GitPython.


In [1]:
import pandas as pd
import git

With GitPython, you can access a Git repository via a Repo object. That's your entry point to the world of Git.

For this notebook, we analyze the Sprint PetClinic repository that can be easily cloned to your local computer with a

git clone https://github.com/spring-projects/spring-petclinic.git

Repo needs at least the directory to your Git repository. I've added an additional argument odbt with the git.GitCmdObjectDB. With this, GitPython will be using a more performant approach for retrieving all the data (see doc for more details).


In [2]:
repo = git.Repo(r'C:\dev\repos\spring-petclinic', odbt=git.GitCmdObjectDB)
repo


Out[2]:
<git.Repo "C:\dev\repos\spring-petclinic\.git">

To transform the complete repository into Pandas' DataFrame, we simply iterate over all commits of the master branch.


In [3]:
commits = pd.DataFrame(repo.iter_commits('master'), columns=['raw'])
commits.head()


Out[3]:
raw
0 ffa967c94b65a70ea6d3b44275632821838d9fd3
1 fd1c742d4f8d193eb935519909c15302b783cd52
2 f792522b3dffca918f52010c8593999088034e19
3 75912a06c5613a2ea1305ad4d8ad6bc4be7765ce
4 443d35eae23c874ed38305fbe75216339c41beaf

Our raw column now contains all the commits as PythonGit's Commit Objects (to be more accurate: references to these objects). The string representation is coincidental the SHA key of the commit.

Investigate commit data

Let's have a look at the last commit.


In [4]:
last_commit = commits.ix[0, 'raw']
last_commit


Out[4]:
<git.Commit "ffa967c94b65a70ea6d3b44275632821838d9fd3">

Such a Commit object is our entry point for retrieving further data.


In [5]:
print(last_commit.__doc__)


Wraps a git Commit object.

    This class will act lazily on some of its attributes and will query the
    value on demand only if it involves calling the git binary.

It provides all data we need:


In [6]:
last_commit.__slots__


Out[6]:
('tree',
 'author',
 'authored_date',
 'author_tz_offset',
 'committer',
 'committed_date',
 'committer_tz_offset',
 'message',
 'parents',
 'encoding',
 'gpgsig')

E. g. basic data like the commit message.


In [7]:
last_commit.message


Out[7]:
'spring-petclinic-angular1 repo renamed to spring-petclinic-angularjs'

Or the date of the commit


In [8]:
last_commit.committed_datetime


Out[8]:
datetime.datetime(2017, 4, 12, 21, 41, tzinfo=<git.objects.util.tzoffset object at 0x0000025943CEC198>)

Some information about the author.


In [9]:
last_commit.author.name


Out[9]:
'Antoine Rey'

In [10]:
last_commit.author.email


Out[10]:
'antoine.rey@gmail.com'

Or file statistics about the commit,


In [11]:
last_commit.stats.files


Out[11]:
{'readme.md': {'deletions': 1, 'insertions': 1, 'lines': 2}}

Fill the DataFrame with data

Let's check how fast we can retrieve all the authors from the commit's data.


In [12]:
%%time
commits['author'] = commits['raw'].apply(lambda x: x.author.name)
commits.head()


Wall time: 62.5 ms

Let's go further and retrieve some more data (DataFrame is transposed / rotated via a T for displaying reasons).


In [13]:
%%time
commits['email'] = commits['raw'].apply(lambda x: x.author.email)
commits['committed_date'] = commits['raw'].apply(lambda x: pd.to_datetime(x.committed_datetime))
commits['message'] = commits['raw'].apply(lambda x: x.message)
commits['sha'] = commits['raw'].apply(lambda x: str(x))
commits.head(2).T


Wall time: 78.1 ms

Dead easy and reasonable fast, but what about the modified files? Let's challenge our computer a little bit more by extracting the statistics data about every commit. The Stats object contains all the touched files per commit including the information about the number of lines that were either inserted or deleted.

Additionally, we need some tricks to get the data we need. For this, I guide you step by step through this approach. The main idea is to retrieve the real statistics data (not only the object's references) and temporarily store these statistics information as Pandas' Series. Then we take another round to transform this data to use it in DataFrame.

Cracking the stats files statistic object

This step is a little bit tricky and was found only by a good amount of trial and error. But it works in the end as we will see. The goal is to unpack the information in the stats object into nice columns of out DataFrame via the Series#apply method. I'll show you step by step how this works in principle (albeit it will work a little bit different when using the apply approach).

As seen above, we have access to every file modification of each commit. In the end, it's a dictionary with the filename as the key and a dictionary of the change attributes as values.


In [14]:
some_commit = commits.ix[56, 'raw']
some_commit.stats.files


Out[14]:
{'src/main/webapp/WEB-INF/tags/menu.tag': {'deletions': 2,
  'insertions': 2,
  'lines': 4}}

We extract the dictionary of dictionaries in two steps. We have to keep in mind that all tricky data transformation is highly dependent on the right index. But first things first.

First, to the outer dictionary: We create a Series of the dictionary.


In [15]:
dict_as_series = pd.Series(some_commit.stats.files)
dict_as_series


Out[15]:
src/main/webapp/WEB-INF/tags/menu.tag    {'insertions': 2, 'deletions': 2, 'lines': 4}
dtype: object

Second, we wrap that series into a DataFrame (for index reasons):


In [16]:
dict_as_series_wrapped_in_dataframe = pd.DataFrame(dict_as_series)
dict_as_series_wrapped_in_dataframe


Out[16]:
0
src/main/webapp/WEB-INF/tags/menu.tag {'insertions': 2, 'deletions': 2, 'lines': 4}

After that, some magic occurs. We stack the DataFrame, meaning that we put our columns into our index which becomes a MultiIndex.


In [17]:
stacked_dataframe = dict_as_series_wrapped_in_dataframe.stack()
stacked_dataframe


Out[17]:
src/main/webapp/WEB-INF/tags/menu.tag  0    {'insertions': 2, 'deletions': 2, 'lines': 4}
dtype: object

In [18]:
stacked_dataframe.index


Out[18]:
MultiIndex(levels=[['src/main/webapp/WEB-INF/tags/menu.tag'], [0]],
           labels=[[0], [0]])

With some manipulation of the index, we achive what we need: an expansion of the rows for each file in a commit.


In [19]:
stacked_dataframe.reset_index().set_index('level_1')


Out[19]:
level_0 0
level_1
0 src/main/webapp/WEB-INF/tags/menu.tag {'insertions': 2, 'deletions': 2, 'lines': 4}

With this (dirty?) trick, we achieved that all files from the stats object can be assigned to the original index of our DataFrame.

In the context of a call with the apply method, the command looks a little bit different, but in the end, the result is the same (I took a commit with multiple modified files from the DataFrame just to show the tranformation a little bit better):


In [20]:
pd.DataFrame(commits[64:65]['raw'].apply(
    lambda x: pd.Series(x.stats.files)).stack()).reset_index(level=1)


Out[20]:
level_1 0
64 readme.md {'insertions': 2, 'deletions': 2, 'lines': 4}
64 src/main/java/org/springframework/samples/petc... {'insertions': 111, 'deletions': 0, 'lines': 111}
64 src/main/webapp/WEB-INF/web.xml {'insertions': 0, 'deletions': 118, 'lines': 118}

In [21]:
%%time
stats = pd.DataFrame(commits['raw'].apply(
    lambda x: pd.Series(x.stats.files)).stack()).reset_index(level=1)
stats = stats.rename(columns={ 'level_1' : 'filename', 0 : 'stats_modifications'})
stats.head()


Wall time: 23.9 s

Unfortunately, this takes almost 30 seconds on my machine :-( (Help needed! Maybe there is a better way for doing this).

Next, we extract the data from the stats_modification column. We do this by simply wrapping the dictionary in a Series, that will return the data needed.


In [22]:
pd.Series(stats.ix[0, 'stats_modifications'])


Out[22]:
deletions     1
insertions    1
lines         2
dtype: int64

With an apply, it looks a little bit different because we are applying the lambda function along the DataFrame's index.

We get a warning because there seems to be a problem with the ordering of the index. But I haven't found any errors so far with this approach.


In [23]:
stats_modifications = stats['stats_modifications'].apply(lambda x: pd.Series(x))
stats_modifications.head(7)


Out[23]:
deletions insertions lines
0 1 1 2
1 0 1 1
2 0 10 10
2 3 21 24
2 3 0 3
3 1 1 2
3 9 11 20

We join the newly created data with the existing one with a join method.


In [24]:
stats = stats.join(stats_modifications)
stats.head()


Out[24]:
filename stats_modifications deletions insertions lines
0 readme.md {'insertions': 1, 'deletions': 1, 'lines': 2} 1 1 2
1 pom.xml {'insertions': 1, 'deletions': 0, 'lines': 1} 0 1 1
2 pom.xml {'insertions': 10, 'deletions': 0, 'lines': 10} 0 10 10
2 pom.xml {'insertions': 10, 'deletions': 0, 'lines': 10} 3 21 24
2 pom.xml {'insertions': 10, 'deletions': 0, 'lines': 10} 3 0 3

After we get rid of the now obsolete stats_modifications columns...


In [25]:
del(stats['stats_modifications'])
stats.head()


Out[25]:
filename deletions insertions lines
0 readme.md 1 1 2
1 pom.xml 0 1 1
2 pom.xml 0 10 10
2 pom.xml 3 21 24
2 pom.xml 3 0 3

...we join the existing DataFrame with the stats information (transposed for displaying reasons)...


In [26]:
commits = commits.join(stats)
commits.head(2).T


Out[26]:
0 1
raw ffa967c94b65a70ea6d3b44275632821838d9fd3 fd1c742d4f8d193eb935519909c15302b783cd52
author Antoine Rey Antoine Rey
email antoine.rey@gmail.com antoine.rey@gmail.com
committed_date 2017-04-12 21:41:00+02:00 2017-03-06 08:12:14+00:00
message spring-petclinic-angular1 repo renamed to spri... Do not fail maven build when git directing is ...
sha ffa967c94b65a70ea6d3b44275632821838d9fd3 fd1c742d4f8d193eb935519909c15302b783cd52
filename readme.md pom.xml
deletions 1 0
insertions 1 1
lines 2 1

...and come to an end by deleting the raw data column, too (and also transposed for displaying reasons).


In [27]:
del(commits['raw'])
commits.head(2).T


Out[27]:
0 1
author Antoine Rey Antoine Rey
email antoine.rey@gmail.com antoine.rey@gmail.com
committed_date 2017-04-12 21:41:00+02:00 2017-03-06 08:12:14+00:00
message spring-petclinic-angular1 repo renamed to spri... Do not fail maven build when git directing is ...
sha ffa967c94b65a70ea6d3b44275632821838d9fd3 fd1c742d4f8d193eb935519909c15302b783cd52
filename readme.md pom.xml
deletions 1 0
insertions 1 1
lines 2 1

So we're finished! A DataFrame that contains all the repository information needed for further analysis!


In [28]:
commits.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 2228366 entries, 0 to 557
Data columns (total 9 columns):
author            object
email             object
committed_date    object
message           object
sha               object
filename          object
deletions         float64
insertions        float64
lines             float64
dtypes: float64(3), object(6)
memory usage: 170.0+ MB

At the end, we still have our commits from the beginning, but with all information that we can work on in another notebook.


In [29]:
len(commits.index.unique())


Out[29]:
558

Store for later usage

For now, we just store the DataFrame into a h5 format with compression for later usage (we get a warning because of the string objects we're using, but that's no problem AFAIK).


In [30]:
commits.to_hdf("data/commits.h5", 'commits', mode='w', complevel=9, complib='zlib')


C:\dev\Anaconda3\lib\site-packages\pandas\core\generic.py:1101: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block1_values] [items->['author', 'email', 'committed_date', 'message', 'sha', 'filename']]

  return pytables.to_hdf(path_or_buf, key, self, **kwargs)

All in one code block

This notebook is really long because it includes a lot of explanations. But if you just need the code to extract a Git repository, here it is:


In [31]:
import pandas as pd
import git

repo = git.Repo(r'C:\dev\repos\spring-petclinic', odbt=git.GitCmdObjectDB)

commits = pd.DataFrame(repo.iter_commits('master'), columns=['raw'])
commits['author'] = commits['raw'].apply(lambda x: x.author.name)
commits['email'] = commits['raw'].apply(lambda x: x.author.email)
commits['committed_date'] = commits['raw'].apply(lambda x: pd.to_datetime(x.committed_datetime))
commits['message'] = commits['raw'].apply(lambda x: x.message)
commits['sha'] = commits['raw'].apply(lambda x: str(x))

stats = pd.DataFrame(commits['raw'].apply(lambda x: pd.Series(x.stats.files)).stack()).reset_index(level=1)
stats = stats.rename(columns={ 'level_1' : 'filename', 0 : 'stats_modifications'})
stats_modifications = stats['stats_modifications'].apply(lambda x: pd.Series(x))
stats = stats.join(stats_modifications)
del(stats['stats_modifications'])

commits = commits.join(stats)
del(commits['raw'])

commits.to_hdf("data/commits.h5", 'commits', mode='w', complevel=9, complib='zlib')


C:\dev\Anaconda3\lib\site-packages\pandas\core\generic.py:1101: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block1_values] [items->['author', 'email', 'committed_date', 'message', 'sha', 'filename']]

  return pytables.to_hdf(path_or_buf, key, self, **kwargs)

Summary

I hope you aren't demotivated now by my Pandas' approach for extracting data from Git repositories. Agreed, the stats object is little unconventional to work with (and there may be better ways for doing it), but I think in the end, the result is pretty useful.